Numpy Basics

pandas is built on top of numpy for a lot of the operations that it uses. Understanding numpy and how it works will help you in your understanding of pandas.

Numpy Arrays

Numpy arrays are the core of any data analysis. They allow for fast operations on over the entire array.

Importing Numpy - import numpy as np - is standard among many resources.


In [133]:
import numpy as np

Creating Arrays

Create n-dimensional arrays with np.array.


In [169]:
array = np.array([1, 2, 3])
array


Out[169]:
array([1, 2, 3])

The type of arrays are always ndarray.


In [135]:
type(array)


Out[135]:
numpy.ndarray

Array Operations

For standard Python lists, using a mathematical notation on them doesn't do that notation across the lists.

For example, adding two lists only concatenates them.


In [136]:
[1, 2] + [2, 3]


Out[136]:
[1, 2, 2, 3]

But numpy arrays do mathematical operations across each element of the array.


In [137]:
np.array([1,2]) + np.array([2,3])


Out[137]:
array([3, 5])

numpy arrays are fast at those operations across arrays due to the arrays having only the same type.

If we add a string to an integer array the whole array becomes a string.


In [138]:
arr = np.array([1, "2", 3])
arr


Out[138]:
array(['1', '2', '3'],
      dtype='<U11')

Due to type mismatching we can no longer do mathematical operations on those arrays.


In [139]:
array + arr


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-139-3c38d381eb25> in <module>()
----> 1 array + arr

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')

In [140]:
arr = np.array([1, np.int("2"), 3])
arr


Out[140]:
array([1, 2, 3])

In [141]:
array + arr


Out[141]:
array([2, 4, 6])

Multidimensional Arrays

Since numpy arrays are of type ndarray it can support multi-dimensional arrays.


In [142]:
array2d = np.array([[1,2, 3], [4, 5, 6]])
array2d


Out[142]:
array([[1, 2, 3],
       [4, 5, 6]])

In [143]:
array4d = np.array([[1, 2], [3, 4], [5, 6], [7,8]])
array4d


Out[143]:
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

Array Indexing

Similar to other languages, numpy allows you to select an item from an array by index.


In [144]:
print(array)
array[1]


[1 2 3]
Out[144]:
2

Python allows for a negative index to go backwards in the array. While going forward, indexes are zero-based. With a negative index, it is one-based.


In [145]:
array[-1]


Out[145]:
3

Array Slicing

Python includes a syntax to slice, or get sub-groups of arrays with the : notation.

The general syntax is array[startIndex : stopIndex: step], with startIndex and stopIndex being inclusive.


In [146]:
array[0:2:1]


Out[146]:
array([1, 2])

The step syntax can be omitted.


In [147]:
array[0:2]


Out[147]:
array([1, 2])

The startIndex can be omitted to indicate it starts from the beginning of the array.


In [148]:
array[:2]


Out[148]:
array([1, 2])

The same is true for the stopIndex.


In [149]:
array[1:]


Out[149]:
array([2, 3])

Slicing can also be done with multi-dimensional arrays. Slices of each dimension is separated by a ,.


In [150]:
print(array2d)


[[1 2 3]
 [4 5 6]]

In [151]:
array2d[:, 0] # Return elements in all rows and only the first column


Out[151]:
array([1, 4])

In [152]:
array2d[0, :] # Return elements in the first row and all columns


Out[152]:
array([1, 2, 3])

In [153]:
array2d[1, 1] # Return the element in the second row and the second column


Out[153]:
5

Random Methods

Some of the most often used methods in numpy are to generate random data.

The randint method is one that comes up a lot. The first parameter will be the highest number to choose from, exclusively. The size parameter tells what length of an array to return.


In [168]:
np.random.randint(5, size=2)


Out[168]:
array([3, 3])

The seed method seeds the random generator to allow for reproducible results.


In [166]:
np.random.seed(0)